SoleSense: An Intelligent Ensemble-Based Deep Learning Framework for Real-Time Footwear Size Prediction and 3D Virtual Try-On

Authors: Aaditi Santosh Chede , Payal Ramesh Chemate, Megha Sandip Kaware, Prof. Hande V. S.

DOI Link: https://doi.org/10.22214/ijraset.2026.78915

Abstract

In the rapidly evolving landscape of global e-commerce, footwear remains one of the most challenging categories due to the high variance in sizing standards and the lack of tactile or spatial feedback for consumers. This research presents **SoleSense**, a multi-component intelligent platform designed to bridge the chasm between digital representation and physical fit. SoleSense integrates three core innovations: an **Ensemble Deep Learning Model** for size prediction, a **Real-Time Augmented Reality (AR) Try-On** system, and a **Template-Based 3D Reconstruction** engine. The ensemble model utilizes three specialized ResNet18-based \"experts\" (Small, Medium, and Large size ranges) that work in tandem to predict European shoe sizes with a significantly higher degree of precision than monolithic architectures. Experimental results on a dataset of 6,944 images demonstrate a Mean Absolute Error (MAE) of 6.22 EU across all ranges, with a 90.1% accuracy (±1 size) in the Large specialist category. Complementing the predictive core is a MediaPipe-powered AR overlay that enables users to visualize footwear in situ via their camera feed. Finally, the system leverages 3D template deformation to reconstruct the user\'s foot mesh, providing a quantitative \"Fit Score\" for 3D shoe models. SoleSense represents a comprehensive Full-Stack solution (Flask/Python/PyTorch/Three.js) that addresses the economic drain of high return rates in the footwear industry while enhancing the user experience through immersive technology. By providing a virtual \"Sizing Consultant,\" SoleSense moves the industry from a selection-based model to a curation-based model, ensuring sustainable e-commerce growth. This research contributes to the fields of computer vision, deep learning, and human-computer interaction by demonstrating a scalable, high-fidelity approach to digital ergonomics.

Introduction

The digital transformation of retail has shifted consumer purchases from physical stores to online platforms, but the footwear sector faces unique challenges due to precise sizing requirements. E-commerce return rates for shoes can reach 30–40%, primarily due to incorrect fit, causing economic losses and environmental impact. To address this, SoleSense leverages computer vision, deep learning, and AR to provide accurate, personalized footwear sizing using standard consumer hardware.

Unlike early anthropometric or photo-based methods, SoleSense employs a hierarchical ensemble of CNN “Specialist” models to capture sub-millimeter foot morphology, integrating gender-specific features and nonlinear size mapping. It combines a deep learning Intelligence Layer with a 3D AR visualization layer, offering users a “Virtual Mirror” for fit validation and confidence in online purchases.

The system’s modular architecture includes layers for ingestion (foot landmark detection), intelligence (ensemble regression), visualization (AR and 3D reconstruction), application (Flask backend), and data (secure storage). Training uses a large, augmented dataset of diverse foot images, optimized via mean squared error, regularization, and early stopping. This integration of AI-driven measurement, AR visualization, and e-commerce infrastructure reduces returns, improves user experience, and promotes economic and environmental efficiency, representing a significant advance in smart footwear retail.

Conclusion

The SoleSense project successfully demonstrates the feasibility of an intelligent, ensemble-based framework for footwear e-commerce. By integrating Deep Learning (ResNet18), Computer Vision (MediaPipe AR), and 3D Mesh Reconstruction (Three.js), we have created a singular ecosystem that addresses the entire customer journey—from initial sizing to immersive visualization and purchase. The research confirms that a hierarchical ensemble of specialists significantly outperforms monolithic architectures by capturing the non-linear nuances of human morphology. While challenges remain in the \"Small\" size specialists due to scale ambiguity, the overall system provides a robust, scalable, and secure solution for the modern retail landscape. SoleSense stands as a testament to the power of \"Intelligent Interaction,\" transforming a simple smartphone into a high-precision digital fitting room. By reducing footwear returns and increasing consumer confidence, this technology contributes to both economic efficiency and environmental sustainability. Future work will focus on integrating Gait Analysis to provide \"Comfort Metrics\" and moving PyTorch models to mobile-friendly formats like ONNX for \"Offline\" client-side inference. The roadmap also includes 3D scanning from multi-angle side profiles to calculate arch height more accurately. Ultimately, SoleSense paves the way for a more efficient and confident future in global footwear trade, where every digital product is guaranteed to fit its physical counterpart perfectly. This research concludes that the convergence of deep learning and computer vision is the most viable path toward a return-free, immersive digital retail experience.

References

[1] Lugaresi, C., et al. (2019). \"MediaPipe: A Framework for Building Perception Pipelines.\" arXiv preprint arXiv:1906.08172. [2] He, K., et al. (2016). \"Deep Residual Learning for Image Recognition.\" CVPR. [3] IJRASET Template Guidelines. [Online]. Available: http://www.ijraset.com/ [4] Chollet, F. (2017). \"Xception: Deep Learning with Depthwise Separable Convolutions.\" CVPR. [5] Harris, C. R., et al. (2020). \"Array programming with NumPy.\" Nature. [6] Paszke, A., et al. (2019). \"PyTorch: An Imperative Style, High-Performance Deep Learning Library.\" NeurIPS. [7] Grigorev, A., et al. (2020). \"3D Reconstruction from a Single Image.\" Computer Vision Foundation. [8] Ioffe, S., & Szegedy, C. (2015). \"Batch Normalization: Accelerating Deep Network Training.\" ICML. [9] Ronneberger, O., et al. (2015). \"U-Net: Convolutional Networks for Biomedical Image Segmentation.\" MICCAI. [10] Simonyan, K., & Zisserman, A. (2014). \"Very Deep Convolutional Networks for Large-Scale Image Recognition.\" arXiv. [11] Tan, M., & Le, Q. V. (2019). \"EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.\" ICML. [12] Vaswani, A., et al. (2017). \"Attention Is All You Need.\" NeurIPS.

Copyright

Copyright © 2026 Aaditi Santosh Chede , Payal Ramesh Chemate, Megha Sandip Kaware, Prof. Hande V. S.. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET78915

Publish Date : 2026-03-28

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here